MACHINE LEARNING WITH CALIBRATION TECHNIQUE TO DETECT FRAUDULENT CREDIT CARD TRANSACTIONS

Goal:

Predict the probability of an online credit card transaction being fraudulent, based on different properties of the transactions.

Table Of Contents

1. Setup Environment

The goal of this section is to:

2. Data Overview

Purpose is to:

  1. Load the datasets
  2. Explore the features

The data is broken into two files identity and transaction, which are joined by “TransactionID”.

Note: Not all transactions have corresponding identity information.

Load the transaction and identity datasets using pd.read_csv()

Identity Data Description

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. They're collected by Vesta’s fraud protection system and digital security partners. (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:

Transaction Data Description

3. Optimize Memory Used by Data

There are few features which has data that needs smaller memory size to hold it but the current data type is occupying more memory. Hence reducing memory usage by the data is very much necessary. This section gives a function to modify the data type of each feature.

Memory occupied by the dataframe (in mb)

Certain features occupy more memory than what is needed to store them. Reducing the memory usage by changing data type will speed up the computations.

Let's create a function for that:

Use the defined function to reduce the memory usage

4. Basic Data Stats

Before attempting to solve the problem, it's very important to have a good understanding of data.

The goal of this section is to:

Shape of dataframe

The dataset has 144233 rows and 41 columns

The dataset has 590540 rows and 394 columns

Check how many transactions has ID info

Summary of dataframe

By looking at the summary of datasets, it's clear there is a lot of missing values in the dataset.

Let's get missing value stats and various other stats of columns in dataframe.

Stats on Transaction Dataset

Check class imbalance

Inferences:

Lot of interesting things can be observed here:

5. Data Preprocessing for EDA

The goal of this section is to:

Let's start with the first task to merge datasets to form one.

Merge the datasets

Get dimensions of training dataset

Since left join was performed on transaction dataset, number of rows are same as transaction dataset.

Add missing flag

Clean Data

Let's drop the columns which may not be useful for our analysis

Create a missing value flag column for the columns we are dropping which have more than 90% missing values, there might be some specific pattern associated with missing values and transaction being fraud

Remove the columns which doesn't having any variance

Filter the dataset with only good columns

Get dimentions of training dataset

Create date features

Let's create date features from TransactionDT features

6. Exploratory Data Analysis

Exploratory data analysis is an approach to analyze or investigate data sets to find out patterns and see if any of the variables can be useful to explain / predict the y variables.

Visual methods are often used to summarise the data. Primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing tasks.

The goal of this section is to:

Check distribution of target variable

Let's check the distribution of target class using a bar plot and check the proportion of transactions amounts being fraud

Inferences:

Let's explore the Transaction amount further

Check distribution of Transaction Amount

There are certain transactions which are of very high amount, let's remove those to check the distribution

Most transactions lie in < $200 range

Transaction amount is right skewed.

Let's look at the log of transaction amount

Inferences:

Product Features

Inferences:

Card Features

Inferences:

Inferences:

P emaildomain

Inferences:

R-Email Domain

Inferences:

Days of the Month

Reference date is not known, it has been assumed. So can't say concretely if the day number is correct

Inferences:

Days of the week

Reference date is not known, it has been assumed. So can't say concretely if the day number is correct

Inferences:

Hour of the Day

Inferences:

Device Type

Inferences:

Columns from identity data

Get column names

7. Statistical Significance test

Chi square test for categorical columns

Calculate odds

Chi-Square test tells if the entire variable is useful or not.

Odds

Odds Ratio

Highers odds ratio implies more chance of fraud in that category.

Farther away it is from 1.0 (both directions) more important the variable is.

8. ANOVA Test

EDA Inferences:

7. Feature Engineering

Feature engineering is the process of using domain and statistical knowledge to extract features from raw data via data mining techniques.

These features often help to improve the performance of machine learning models.

The goal of this section is to:

Domain Specific Features

You need to engineer the domain specific features. This might boost up the predictive power. This often gives better performing models

Domain knowledge is one of the key pillars of data science. So always understand the domain before attempting the problem.

Replace value by the group's mean (or standard dev)

8. Dimensionality Reduction - PCA

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.

Principal component analysis is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

Create a list of all the columns on which PCA needs to performed

Impute missing values in the mas_v columns, later use minmax_scale function to scale the values in these columns

Reduce memory usage of df as lot of new features have been created

9. Feature Encoding

Encoding is the process of converting data from one form to another. Most of the Machine learning algorithms can not handle categorical values unless we convert them to numerical values. Many algorithm’s performances vary based on how Categorical columns are encoded.

Create a list of variables that needs to be encoded using frequency encoding. Let's note down the features which has more than 30 unique values, We would using frequency encoding for these features only

It's time to encode the variables using frequency encoding

It is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

Let's reduce the memory usage as lot of new columns has been added to the data frame

Tip : Save the train df, and clean all memory

10. Data Preprocessing for Model Building

The goal of this section is to:

Drop the columns which may not be useful for model building

Separate the x variables and y variables

Split the dataset into train set and test set. Train set will be used to train the model. Test set will be used to check the performance of model

11. Model Building

Finally, model building starts here.

The goal of this section is to:

12. XGBoost Classifier

XGBoost is an optimized distributed gradient boosting model designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

13. Evaluation Metrics

Concordance

All evaluation metrics

14. Capture Rates and Calibration Curve

Divide the data in 10 equal bins as per predicted probability scores. Then, compute the percentage of the total target class 1 captured in every bin.

Ideally the proportion should be decreasing as we go down ever bin. Let's check it out

Create validation set

Capture Rates Plot

Gains Table

Ideally the slope should be high initially and should decrease as we move further to the right. This is not really a good model.

Calibration Curve

Calibrate the model

Logistic regression

XGBoost with booster = dart

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

Prediction

Evaluation Metrices

Let's compute various evaluation metrices now

Inferences:

Let's look at LGBM

15. LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms.

It is designed to be distributed and efficient with the following advantages:

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

Evaluation Metrices

Let's compute various evaluation metrices now

Inferences:

16. Random Forest Classifier

Impute Missing values. Since sklearn algos are not designed to handle missing values.

Build and train the Classifier

Predicting on test data

Evaluation metrics

17. Handling Class Imbalance

Handle Class Imbalance with Random Oversampler

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

Most machine learning algorithms work best when the number of samples in each class are about equal. This is because most algorithms are designed to maximize accuracy and reduce error.

Let's use the model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

Evaluation Metrics

Let's compute various evaluation metrices now

Inferences:

18. Cost Sensitive Learning with Class weights

The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

19. Model Calibration

20. Model Tuning

Hyperparameter is a parameter that governs how the algorithm trains to learn the relationships. The values are set before the learning process begins.

Hyperparameter tuning refers to the automatic optimization of the hyper-parameters of a ML model.

Get the best parameters corresponding to which you have best model

Let's use the best model to get predictions on test dataset. We would be looking at the predicted class and predicted probability both in order to evaluate the performance of the model

Evaluation Metrics

Let's compute various evaluation metrices now

Calibration Curve

Calibrate the model

Inferences:

Hence we can freeze the model.

21. Feature Importance

Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

Inferences:

22. Partial Dependence and Individual Conditional Expectations (ICE)

Fit the model

Plot Partial Dependence

Individual Conditional Expectation (ICE) Plot - card2

Partial Dependence and ICE Plot - C13

23. SHAP Values

SHAP values is used to reverse engineer the output of the prediction model and quantify the contribution of each predictor for a given prediction.

You can make a partial dependence plot using shap.dependence_plot. This shows the relationship between the feature and the Y. This also automatically includes another feature that your feature interacts frequently with.

Explain a single observation.

Add link = "logit"

Conclusion

The model has been trained and tested, so now one can use it to predict if any transaction would be fraud or not.